Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Toward Eﬃcient Post-Training Quantization of Pre-Trained Language Models

129

FIGURE 5.7

The overview of algorithm proposed in [5].

In summary, this paper’s contributions are as follows: (1) new kernels for the eﬃcient

and accurate integer-only GELU and Softmax. That is, the GELU and Softmax are approx-

imated with lightweight second-order polynomials, which can be evaluated with integer-only

arithmetic; (2) integer-only LayerNorm computation by leveraging a known algorithm for

integer calculation of square root [49]; and (3) a total integer-only quantization for language

models by utilizing the proposed approximations of GELU, Softmax.

5.5

Toward Eﬃcient Post-Training Quantization of Pre-Trained

Language Models

Bai et al. [5] proposes MREM that aims at improving the performance of post-training

quantization for language models, while simultaneously maintaining the training eﬃciency,

memory overhead, and data accessibility equipped by post-training quantization. The al-

gorithm overview proposed in [5] is presented in Fig. 5.7. As can be seen, the full-precision

and quantized models are ﬁrst partitioned into multiple modules, then put on diﬀerent com-

puting devices. Each module samples input tensor from its input queue, which makes them

can be trained locally without waiting for their predecessors. Moreover, teacher forcing is

applied to mitigate the issue of reconstruction error propagation on the quantized module.

5.5.1

Module-Wise Reconstruction Error Minimization

At ﬁrst, the language models are partitioned into multiple modules, each consisting of mul-

tiple transformer layers. Then, they propose module-wise reconstruction error minimization

(MREM) to optimize each module’s model weight and quantization parameters, which per-

mits suﬃcient optimization. Speciﬁcally, given a language model with L transformer layers,

embedding layers and the classiﬁcation head, the model is partitioned into N modules. Sup-

pose the n-th module contains p transformer layers, then it include [lj, lj+1, lj+2, . . . , lj+p−1]

transformer layers with lj being the ﬁrst layer of this module. The proposed MREM aims

at minimizing the joint reconstruction errors between all intermediate output ^ˆfl of the

quantized n-th module from its full-precision counterpart fl as follows:

Ln =

j+p−1

i=j

∥^ˆfli −fli∥2.

(5.10)